Variance Reduction Methods for Sublinear Reinforcement Learning
نویسندگان
چکیده
This work considers the problem of provably optimal reinforcement learning for (episodic) finite horizon MDPs, i.e. how an agent learns to maximize his/her (long term) reward in an uncertain environment. The main contribution is in providing a novel algorithm — Variance-reduced Upper Confidence Q-learning (vUCQ) — which enjoys a regret bound of Õ( √ HSAT +HSA), where the T is the number of time steps the agent acts in the MDP, S is the number of states, A is the number of actions, and H is the (episodic) horizon time. This is the first regret bound that is both sub-linear in the model size and asymptotically optimal. The algorithm is sub-linear in that the time to achieve -average regret (for any constant ) is O(SA), which is a number of samples that is far less than that required to learn any (non-trivial) estimate of the transition model (the transition model is specified by O(SA) parameters). The importance of sub-linear algorithms is largely the motivation for algorithms such as Q-learning and other “model free” approaches. vUCQ algorithm also enjoys minimax optimal regret in the long run, matching the Ω( √ HSAT ) lower bound. Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive refinement method in which the algorithm reduces the variance in Q-value estimates and couples this estimation scheme with an upper confidence based algorithm. Technically, the coupling of both of these techniques is what leads to the algorithm enjoying both the sub-linear regret property and the (asymptotically) optimal regret.
منابع مشابه
Deep Reinforcement Learning
In reinforcement learning (RL), stochastic environments can make learning a policy difficult due to high degrees of variance. As such, variance reduction methods have been investigated in other works, such as advantage estimation and controlvariates estimation. Here, we propose to learn a separate reward estimator to train the value function, to help reduce variance caused by a noisy reward sig...
متن کاملSafe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret
Lifelong reinforcement learning provides a promising framework for developing versatile agents that can accumulate knowledge over a lifetime of experience and rapidly learn new tasks by building upon prior knowledge. However, current lifelong learning methods exhibit non-vanishing regret as the amount of experience increases, and include limitations that can lead to suboptimal or unsafe control...
متن کاملCold-Start Reinforcement Learning with Softmax Policy Gradient
Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity ...
متن کاملStochastic Variance Reduction for Policy Gradient Estimation
Recent advances in policy gradient methods and deep learning have demonstrated their applicability for complex reinforcement learning problems. However, the variance of the performance gradient estimates obtained from the simulation is often excessive, leading to poor sample efficiency. In this paper, we apply the stochastic variance reduced gradient descent (SVRG) technique [1] to model-free p...
متن کاملVariance Reduction for Policy Gradient with Action-Dependent Factorized Baselines
Policy gradient methods have enjoyed great success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high-dimensional action spaces. To mitigate this issue, we derive a bias-free action-dependent baseline for variance reduction which fully exploits the structural form of the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1802.09184 شماره
صفحات -
تاریخ انتشار 2018